ggml-cpu: handle 3d tensors in repack mat_mul #17030

Alcpz · 2025-11-05T16:59:17Z

While testing #16739, perplexities for LFM2 skyrocketed. @ggerganov pointed out that some matrix shapes would probably not be supported.

LFM2 has some layers that have two batches, so MAT_MULs were only done partially, leading to incorrect results. See #16739 (comment)

This patch adds basic support for tensors with ne2 > 1, with very naive chunking based on the non repack MUL MAT.

Perplexities using this patch:

# REPACK ON
[1]9.9763,[2]15.1558,[3]13.9708,[4]13.7465,[5]14.2039,[6]14.6234,[7]14.6543,[8]15.6984,[9]16.7691,[10]17.1773,[11]16.9814,[12]17.2111,[13]17.7539,[14]17.2013,[15]16.8515,[16]16.9276,[17]15.8386,[18]16.1010,[19]15.9863,[20]15.7344,
Final estimate: PPL = 15.7344 +/- 0.63198
# REPACK OFF
[1]9.9763,[2]15.1558,[3]13.9708,[4]13.7465,[5]14.2039,[6]14.6234,[7]14.6543,[8]15.6984,[9]16.7691,[10]17.1773,[11]16.9814,[12]17.2111,[13]17.7539,[14]17.2013,[15]16.8515,[16]16.9276,[17]15.8386,[18]16.1010,[19]15.9863,[20]15.7344,
Final estimate: PPL = 15.7344 +/- 0.63198

I can provide logs for other models if needed.

Repro commands:

# GGML_CPU_REPACK=ON|OFF GGML_BLAS=OFF GGML_METAL=OFF

for model in "unsloth/Qwen3-8B-128K-GGUF:Q4_0 LiquidAI/LFM2-1.2B-GGUF:Q4_0 LiquidAI/LFM2-2.6B-GGUF:Q4_0"; do
  ./bin/llama-perplexity -hf $model -f ./wikitext-2-raw/wiki.test.raw --chunks 100 -dev none
done

Other models:

# Qwen 3 Repack
perplexities_build-cpu-aarm64_Qwen3-8B-128K-GGUF:Q4_0.txt-[1]7.6803,[2]10.1811,[3]9.4260,[4]9.0666,[5]9.2647,[6]9.6980,[7]9.8774,[8]10.4100,[9]10.9424,[10]11.4185,[11]11.4938,[12]11.6893,[13]12.1807,[14]11.7433,[15]11.5808,[16]11.7468,[17]11.0987,[18]11.2603,[19]11.0962,[20]11.1735,[21]10.8974,[22]10.9181,[23]10.4976,[24]9.9920,[25]9.7800,[26]9.5234,[27]9.2917,[28]9.1358,[29]9.1840,[30]9.1386,[31]9.1237,[32]9.1164,[33]8.9839,[34]9.0213,[35]9.0949,[36]9.2154,[37]9.3887,[38]9.4512,[39]9.4129,[40]9.4693,[41]9.4650,[42]9.3915,[43]9.4305,[44]9.4552,[45]9.4605,[46]9.4598,[47]9.6747,[48]9.7829,[49]9.7476,[50]9.8248,[51]9.8489,[52]9.8696,[53]9.9087,[54]9.9802,[55]10.0069,[56]10.0701,[57]10.0272,[58]10.0532,[59]10.1151,[60]10.1645,[61]10.1961,[62]10.2441,[63]10.3282,[64]10.3811,[65]10.4620,[66]10.5540,[67]10.6382,[68]10.6259,[69]10.6246,[70]10.6129,[71]10.6290,[72]10.6951,[73]10.7189,[74]10.7327,[75]10.6625,[76]10.6244,[77]10.6562,[78]10.6942,[79]10.6103,[80]10.5880,[81]10.5408,[82]10.5761,[83]10.5308,[84]10.5104,[85]10.5348,[86]10.6326,[87]10.6827,[88]10.6733,[89]10.6861,[90]10.6783,[91]10.7371,[92]10.6980,[93]10.7394,[94]10.7430,[95]10.7241,[96]10.7199,[97]10.6880,[98]10.6990,[99]10.6692,[100]10.7254,
perplexities_build-cpu-aarm64_Qwen3-8B-128K-GGUF:Q4_0.txt:Final estimate: PPL = 10.7254 +/- 0.20427

# Qwen 3 REPACK OFF
perplexities_build-cpu-aarm64-norepack_Qwen3-8B-128K-GGUF:Q4_0.txt-[1]7.6803,[2]10.1811,[3]9.4260,[4]9.0666,[5]9.2647,[6]9.6980,[7]9.8774,[8]10.4100,[9]10.9424,[10]11.4185,[11]11.4938,[12]11.6893,[13]12.1807,[14]11.7433,[15]11.5808,[16]11.7468,[17]11.0987,[18]11.2603,[19]11.0962,[20]11.1735,[21]10.8974,[22]10.9181,[23]10.4976,[24]9.9920,[25]9.7800,[26]9.5234,[27]9.2917,[28]9.1358,[29]9.1840,[30]9.1386,[31]9.1237,[32]9.1164,[33]8.9839,[34]9.0213,[35]9.0949,[36]9.2154,[37]9.3887,[38]9.4512,[39]9.4129,[40]9.4693,[41]9.4650,[42]9.3915,[43]9.4305,[44]9.4552,[45]9.4605,[46]9.4598,[47]9.6747,[48]9.7829,[49]9.7476,[50]9.8248,[51]9.8489,[52]9.8696,[53]9.9087,[54]9.9802,[55]10.0069,[56]10.0701,[57]10.0272,[58]10.0532,[59]10.1151,[60]10.1645,[61]10.1961,[62]10.2441,[63]10.3282,[64]10.3811,[65]10.4620,[66]10.5540,[67]10.6382,[68]10.6259,[69]10.6246,[70]10.6129,[71]10.6290,[72]10.6951,[73]10.7189,[74]10.7327,[75]10.6625,[76]10.6244,[77]10.6562,[78]10.6942,[79]10.6103,[80]10.5880,[81]10.5408,[82]10.5761,[83]10.5308,[84]10.5104,[85]10.5348,[86]10.6326,[87]10.6827,[88]10.6733,[89]10.6861,[90]10.6783,[91]10.7371,[92]10.6980,[93]10.7394,[94]10.7430,[95]10.7241,[96]10.7199,[97]10.6880,[98]10.6990,[99]10.6692,[100]10.7254,
perplexities_build-cpu-aarm64-norepack_Qwen3-8B-128K-GGUF:Q4_0.txt:Final estimate: PPL = 10.7254 +/- 0.20427

# LFM2 REPACK
perplexities_build-cpu-aarm64_LFM2-2.6B-GGUF:Q4_0.txt-[1]7.0724,[2]11.2417,[3]11.3736,[4]11.0566,[5]11.2978,[6]11.8576,[7]12.1547,[8]12.8728,[9]13.8226,[10]14.0957,[11]13.6415,[12]13.7865,[13]14.1242,[14]13.5275,[15]13.2750,[16]13.1469,[17]12.3869,[18]12.5628,[19]12.4196,[20]12.2570,[21]11.8653,[22]11.8657,[23]11.5625,[24]11.3099,[25]11.2837,[26]11.0172,[27]10.9685,[28]10.9421,[29]10.8844,[30]11.0062,[31]10.9984,[32]11.1214,[33]11.0812,[34]11.0926,[35]11.0572,[36]11.1630,[37]11.3042,[38]11.1564,[39]11.3252,[40]11.2555,[41]11.2296,[42]11.2722,[43]11.3182,[44]11.2066,[45]11.2418,[46]11.3877,[47]11.5001,[48]11.4392,[49]11.4613,[50]11.5636,[51]11.5742,[52]11.5927,[53]11.6412,[54]11.6469,[55]11.7139,[56]11.7273,[57]11.7956,[58]11.8651,[59]11.9185,[60]11.9757,[61]11.9816,[62]12.0535,[63]12.1499,[64]12.2589,[65]12.3879,[66]12.4853,[67]12.4684,[68]12.4438,[69]12.4475,[70]12.4592,[71]12.5043,[72]12.5274,[73]12.5598,[74]12.5025,[75]12.4682,[76]12.4976,[77]12.5186,[78]12.4596,[79]12.3959,[80]12.3615,[81]12.4195,[82]12.4745,[83]12.4321,[84]12.4450,[85]12.5002,[86]12.5583,[87]12.5979,[88]12.5772,[89]12.5398,[90]12.5321,[91]12.4828,[92]12.5500,[93]12.5727,[94]12.5613,[95]12.5658,[96]12.5653,[97]12.5379,[98]12.5156,[99]12.5447,[100]12.5589,
perplexities_build-cpu-aarm64_LFM2-2.6B-GGUF:Q4_0.txt:Final estimate: PPL = 12.5589 +/- 0.21849

# LFM2 REPACK OFF
perplexities_build-cpu-aarm64-norepack_LFM2-2.6B-GGUF:Q4_0.txt-[1]7.0724,[2]11.2417,[3]11.3736,[4]11.0566,[5]11.2978,[6]11.8576,[7]12.1547,[8]12.8728,[9]13.8226,[10]14.0957,[11]13.6415,[12]13.7865,[13]14.1242,[14]13.5275,[15]13.2750,[16]13.1469,[17]12.3869,[18]12.5628,[19]12.4196,[20]12.2570,[21]11.8653,[22]11.8657,[23]11.5625,[24]11.3099,[25]11.2837,[26]11.0172,[27]10.9685,[28]10.9421,[29]10.8844,[30]11.0062,[31]10.9984,[32]11.1214,[33]11.0812,[34]11.0926,[35]11.0572,[36]11.1630,[37]11.3042,[38]11.1564,[39]11.3252,[40]11.2555,[41]11.2296,[42]11.2722,[43]11.3182,[44]11.2066,[45]11.2418,[46]11.3877,[47]11.5001,[48]11.4392,[49]11.4613,[50]11.5636,[51]11.5742,[52]11.5927,[53]11.6412,[54]11.6469,[55]11.7139,[56]11.7273,[57]11.7956,[58]11.8651,[59]11.9185,[60]11.9757,[61]11.9816,[62]12.0535,[63]12.1499,[64]12.2589,[65]12.3879,[66]12.4853,[67]12.4684,[68]12.4438,[69]12.4475,[70]12.4592,[71]12.5043,[72]12.5274,[73]12.5598,[74]12.5025,[75]12.4682,[76]12.4976,[77]12.5186,[78]12.4596,[79]12.3959,[80]12.3615,[81]12.4195,[82]12.4745,[83]12.4321,[84]12.4450,[85]12.5002,[86]12.5583,[87]12.5979,[88]12.5772,[89]12.5398,[90]12.5321,[91]12.4828,[92]12.5500,[93]12.5727,[94]12.5613,[95]12.5658,[96]12.5653,[97]12.5379,[98]12.5156,[99]12.5447,[100]12.5589,
perplexities_build-cpu-aarm64-norepack_LFM2-2.6B-GGUF:Q4_0.txt:Final estimate: PPL = 12.5589 +/- 0.21849

Alcpz · 2025-11-10T10:39:46Z

@ggerganov This is ready for review now. Thanks for your patience.

ggerganov · 2025-11-10T13:02:40Z

ggml/src/ggml-cpu/repack.cpp

+
+        const char * src0_ptr = (const char *) src0->data + i02 * nb02;
+        const char * src1_ptr = (const char *) params->wdata + (i11 + i12 * ne11) * src1_col_stride;
+        char *       dst_ptr  = ((char *) dst->data + (i1 * nb1 + i2 * nb2));
+


Add GGML_ASSERT here that guarantees we are within bounds of [params->wdata, params->wdata + params->wsize)

Added one GGML_ASSERT for the upper bound. The lower bounds should always be > params->wdata since as long as ne1 and ne11 are >= 1, i11 and i12 are positive.

Alcpz · 2025-11-12T10:03:25Z

@ggerganov I've addressed all your comments. Let me know if something else is required.

max-krasnyansky · 2025-11-13T03:00:13Z

@Alcpz
This PR causes significant performance regression for Prompt processing because it creates a lot more chunks than before.

Here is llama3.2-1B-Q4_0 running with 6 threads with instrumented matmul code.

The instrumentation simply counts number of processed chunks and the time per thread
repack-chunking-inst.diff.txt

After this PR                                 Before this PR
thread-2: Qcur-11 nchunks 38 usec 1844        thread-4: Qcur-11 nchunks 6 usec 1496
thread-3: Qcur-11 nchunks 38 usec 1844        thread-0: Qcur-11 nchunks 6 usec 1498
thread-4: Qcur-11 nchunks 17 usec 1874        thread-5: Qcur-11 nchunks 3 usec 1597
thread-5: Qcur-11 nchunks 17 usec 1948        thread-1: Qcur-11 nchunks 3 usec 1640
thread-1: Qcur-11 nchunks 17 usec 1894        thread-2: Qcur-11 nchunks 3 usec 1685
thread-0: Qcur-11 nchunks 17 usec 1876        thread-3: Qcur-11 nchunks 3 usec 1718
thread-4: Vcur-11 nchunks 17 usec 607         thread-5: Vcur-11 nchunks 6 usec 508
thread-5: Vcur-11 nchunks 17 usec 638         thread-4: Vcur-11 nchunks 6 usec 515
thread-2: Vcur-11 nchunks 39 usec 617         thread-0: Vcur-11 nchunks 3 usec 547
thread-1: Vcur-11 nchunks 15 usec 618         thread-2: Vcur-11 nchunks 3 usec 548
thread-0: Vcur-11 nchunks 17 usec 630         thread-1: Vcur-11 nchunks 3 usec 564
thread-3: Vcur-11 nchunks 39 usec 617         thread-3: Vcur-11 nchunks 3 usec 596
thread-5: Kcur-11 nchunks 38 usec 611         thread-5: Kcur-11 nchunks 6 usec 484
thread-1: Kcur-11 nchunks 17 usec 615         thread-0: Kcur-11 nchunks 6 usec 490
thread-0: Kcur-11 nchunks 17 usec 617         thread-1: Kcur-11 nchunks 3 usec 547
thread-2: Kcur-11 nchunks 17 usec 628         thread-3: Kcur-11 nchunks 3 usec 548
thread-4: Kcur-11 nchunks 38 usec 611         thread-4: Kcur-11 nchunks 3 usec 557
thread-3: Kcur-11 nchunks 17 usec 649         thread-2: Kcur-11 nchunks 3 usec 547
thread-3: attn_out-11 nchunks 38 usec 1835    thread-4: attn_out-11 nchunks 6 usec 1567
thread-5: attn_out-11 nchunks 38 usec 1847    thread-5: attn_out-11 nchunks 6 usec 1569
thread-0: attn_out-11 nchunks 17 usec 1880    thread-1: attn_out-11 nchunks 3 usec 1637
thread-4: attn_out-11 nchunks 17 usec 1886    thread-2: attn_out-11 nchunks 3 usec 1639
thread-1: attn_out-11 nchunks 17 usec 1890    thread-3: attn_out-11 nchunks 3 usec 1642
thread-2: attn_out-11 nchunks 17 usec 1897    thread-0: attn_out-11 nchunks 3 usec 1649
thread-3: ffn_gate-11 nchunks 38 usec 4886    thread-5: ffn_gate-11 nchunks 6 usec 4103
thread-2: ffn_gate-11 nchunks 38 usec 4887    thread-4: ffn_gate-11 nchunks 6 usec 4141
thread-5: ffn_gate-11 nchunks 17 usec 4992    thread-0: ffn_gate-11 nchunks 3 usec 4298
thread-1: ffn_gate-11 nchunks 17 usec 5010    thread-1: ffn_gate-11 nchunks 3 usec 4357
thread-4: ffn_gate-11 nchunks 17 usec 5010    thread-2: ffn_gate-11 nchunks 3 usec 4373
thread-0: ffn_gate-11 nchunks 17 usec 5032    thread-3: ffn_gate-11 nchunks 3 usec 4447
thread-5: ffn_up-11 nchunks 38 usec 4908      thread-0: ffn_up-11 nchunks 6 usec 4107
thread-3: ffn_up-11 nchunks 38 usec 4909      thread-5: ffn_up-11 nchunks 6 usec 4129
thread-4: ffn_up-11 nchunks 17 usec 5000      thread-1: ffn_up-11 nchunks 3 usec 4362
thread-0: ffn_up-11 nchunks 17 usec 5005      thread-4: ffn_up-11 nchunks 3 usec 4377
thread-1: ffn_up-11 nchunks 17 usec 5008      thread-3: ffn_up-11 nchunks 3 usec 4400
thread-2: ffn_up-11 nchunks 17 usec 5037      thread-2: ffn_up-11 nchunks 3 usec 4381
thread-5: ffn_out-11 nchunks 38 usec 4924     thread-5: ffn_out-11 nchunks 6 usec 4089
thread-4: ffn_out-11 nchunks 38 usec 4928     thread-0: ffn_out-11 nchunks 6 usec 4089
thread-1: ffn_out-11 nchunks 17 usec 5006     thread-3: ffn_out-11 nchunks 3 usec 4386
thread-2: ffn_out-11 nchunks 17 usec 5010     thread-2: ffn_out-11 nchunks 3 usec 4386
thread-0: ffn_out-11 nchunks 17 usec 5011     thread-4: ffn_out-11 nchunks 3 usec 4414
thread-3: ffn_out-11 nchunks 17 usec 5023     thread-1: ffn_out-11 nchunks 3 usec 4391

That's way too many chunks and we burn a lot of time on syncrhonization.
If you have an idea for a quick fix that you can test on LFM2 please start another PR and I'll verify on my setup.
Make sure to test with Llama3.2 and Qwen3 models with instrumented code.

Alcpz · 2025-11-13T10:47:09Z

Mmm. Let's revert this then. I will reopen a PR with the branch as a draft and we can have a better solution. I'd rather not introduce a regression upstream. @ggerganov Mind doing the revert?

This reverts commit 1c398dc.

Alcpz requested review from ggerganov and slaren as code owners November 5, 2025 16:59

Alcpz changed the title ~~ggml-cpu: handle 3d tensors in repack mul_mat~~ ggml-cpu: handle 3d tensors in repack mat_mul Nov 5, 2025

Alcpz mentioned this pull request Nov 5, 2025

ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) #16739

Open

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 5, 2025

DajanaV mentioned this pull request Nov 5, 2025

UPSTREAM PR #17030: ggml-cpu: handle 3d tensors in repack mat_mul auroralabs-loci/llama.cpp#94

Open

ggml-cpu: handle 3d tensors in repack mul_mat

950671d

Alcpz force-pushed the Alcpz/batched_repack_mul_mat branch from eadb483 to 950671d Compare November 5, 2025 18:03

Removed unnecessary branch, removed need for <algorithm>

0b86651

Alcpz marked this pull request as draft November 5, 2025 18:46

Fixed dst_ptr pointer in chunk + clang_format

75c7fd5

Alcpz marked this pull request as ready for review November 6, 2025 12:24

ggerganov approved these changes Nov 10, 2025

View reviewed changes

Alcpz added 3 commits November 10, 2025 17:54

GGML_ASSERT to check wdata within bounds

edb7f63

Accidental ggml.h inclusion

b56d0ac

Improved GGML_ASSERT on wdata boundaries

d1938ad

Alcpz force-pushed the Alcpz/batched_repack_mul_mat branch from 5a202b9 to d1938ad Compare November 10, 2025 20:26

ggerganov merged commit 1c398dc into ggml-org:master Nov 12, 2025
71 checks passed

ggerganov added a commit that referenced this pull request Nov 13, 2025

Revert "ggml-cpu: handle 3d tensors in repack mat_mul (#17030)"

4fb12aa

This reverts commit 1c398dc.

ggerganov mentioned this pull request Nov 13, 2025

Revert "ggml-cpu: handle 3d tensors in repack mat_mul (#17030)" #17233

Merged

ggerganov added a commit that referenced this pull request Nov 13, 2025

Revert "ggml-cpu: handle 3d tensors in repack mat_mul (#17030)" (#17233)

2776db6

This reverts commit 1c398dc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: handle 3d tensors in repack mat_mul #17030

ggml-cpu: handle 3d tensors in repack mat_mul #17030

Uh oh!

Alcpz commented Nov 5, 2025 •

edited

Loading

Uh oh!

Alcpz commented Nov 10, 2025

Uh oh!

ggerganov Nov 10, 2025

Uh oh!

Alcpz Nov 10, 2025

Uh oh!

Alcpz commented Nov 12, 2025

Uh oh!

Uh oh!

max-krasnyansky commented Nov 13, 2025 •

edited

Loading

Uh oh!

Alcpz commented Nov 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggml-cpu: handle 3d tensors in repack mat_mul #17030

ggml-cpu: handle 3d tensors in repack mat_mul #17030

Uh oh!

Conversation

Alcpz commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alcpz commented Nov 10, 2025

Uh oh!

ggerganov Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Alcpz Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Alcpz commented Nov 12, 2025

Uh oh!

Uh oh!

max-krasnyansky commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alcpz commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Alcpz commented Nov 5, 2025 •

edited

Loading

max-krasnyansky commented Nov 13, 2025 •

edited

Loading

Alcpz commented Nov 13, 2025 •

edited

Loading